System-Level Implications of Processor-Memory Integration

نویسنده

  • Doug Burger
چکیده

In this paper, we address the system-level implications of processor/memory integration. Specifically, we explore the effects that very large on-processor memories will have upon both the memory hierarchy as a whole and the processor organization. Our focus is on the migration of memory to the processor, not the migration of inexpensive processors onto commodity DRAM parts (the feasibility of the latter model in the market is still an unanswered question). Using cost/performance models coupled with simulation results, we compare three simple on-chip memory organizations (cache, fraction of main memory, and a hybrid of the two). We then examine the constraints under which all of the main memory may migrate onto the processor, thus enabling IRAM-based systems. Finally, we discuss the implications that large on-processor memories have for chip multiprocessors (CMPs), and we discuss appropriate uses for the multiple on-chip processors. 1 Implications of large on-processor memories The continuing exponential growth in microprocessor performance and real-estate, coupled with the growing gap between processor and stock DRAM performance, is making the performance of the memory hierarchy the key determinant of overall system performance. The growing interest in IRAM chips—which combine processor and physical memory on a single die—reflects the growing importance of the memory hierarchy in system design. IRAM chips have been proposed [2, 10, 13] as a costeffective way to improve memory bandwidth and reduce memory latency, as opposed to the current conventional approach of multiple levels of expensive caches and highperformance inter-chip buses. The complete integration of processors and main memory, if it happens, could take one of two paths, and perhaps both. Successively larger memories will be placed onto the processor (or into the processor package), until the entire physical memory may be fit within the processor package. Alternatively, commodity DRAM manufacturers may begin placing small, inexpensive processors on the DRAM die, which, over time, become powerful enough to obviate the need for a large, central processor in the system. Both directions may occur simultaneously, of course, with the central processor aggregating memory as limited “intelligence” (either a small general-purpose processor or PIMlike logic [5]). While there is excitement in the community about both directions, the “memory to the processor” alternative is much less revolutionary, and it is on this alternative that we focus in this work. Large on-processor memories are a near-certainty in the future. In Figure 1a we plot the recent growth of main memory sizes, the historical and projected [15] increases in the number of microprocessor transistors (Intel x86), and the historical and projected increase in bits per DRAM die. The solid line in Figure 1a represents a least-meansquares regression for the existing and projected microprocessor transistor count growth. The projected growth of microprocessor transistors remains stable, doubling approximately every 18 months with no slowdown of exponential growth. In Figure 1b, we plot the percentage of processor transistor counts devoted to in-package cache memory for a range of microprocessors. The lines represent LMS regressions for a few of the processor families. While it is impossible to extrapolate quantitatively from these numbers, the trend is clear: a growing percentage— reaching 85% in some cases—of microprocessor transistor budgets are allocated to cache memory. A qualitative extrapolation implies that future microprocessors, with their vast numbers of on-chip transistors, will be mostly memory. The rest of this paper is organized as follows. In Section 2, we explore the effects that large on-processor memories will have on the memory hierarchy. Specifically, we will quantify the constraints under which the on-chip memory will be treated as a cache, or as a fast fraction of the physical memory. We present a performance model

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Array Memory CPU & Caches Address & Data Interface Processor Off - chip Cache I

The progressive integration of processor and memory has unexpected implications for the design of DSM systems. To exploit this integration best, we claim that we need to redesign the nodes of DSM systems and then reorganize the whole machine. In this paper, we propose a new DSM organization where processor nodes have their on-chip memories conngured as caches and their directory controllers hav...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure

The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...

متن کامل

Realization of PRAMs: Processor Design

We present a processor architecture for SB-PRAM, a parallel machine with shared address space and uniform memory access time. The processor uses a reduced instruction set and provides in hardware mechanisms for the emulation of shared memory: random hashing to avoid hot spots, multiple contexts with regular scheduling to hide network latency and fast context switch to minimize overhead. Further...

متن کامل

Cmos Process to Reduce Costs and Improve Energy Efficiency, and Then Explores the Logical and Physical Implications of Leveraging This

......Modern embedded, server, graphics, and network processors already include tens to hundreds of cores on a single die, and this number will continue to increase over the next decade. Corresponding increases in main memory bandwidth are also required, however, if the greater core count is to result in improved application performance. Projected enhancements of existing electrical DRAM interf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997